{
 "metadata": {
  "name": "",
  "signature": "sha256:3a750e7e1de1c0be6e91e519fea32cef5b8204c55aa76b5642ef4ef7d7dd5803"
 },
 "nbformat": 3,
 "nbformat_minor": 0,
 "worksheets": [
  {
   "cells": [
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "#Model Assessment and Selection - Part 2\n",
      "##DOF, Information Measures and beyond\n",
      "Scott Hendrickson\n",
      "\n",
      "2015 April 21\n",
      "\n",
      "(With evem more heavy borrowing from ESL!)"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "import numpy as np\n",
      "import matplotlib.pyplot as plt \n",
      "%matplotlib inline\n",
      "from IPython.display import Image"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "#Training Error, Test Error, In-sample Error\n",
      "\n",
      "* We want a good estimate of the prediction error\n",
      "* We have Training Error and Test Error for a particular split of input data (Sample).\n",
      "* Training Error is generally an optimistic estimate of the overall In-sample Error.\n",
      "\n",
      "##Degrees of Freedom\n",
      "\n",
      "Left off Part 1 with with some examples of in-sample and test error estimates as a function of model complexity parameters like the number of nearest neighbors in KNN and $\\lambda$ parameter in the case of the Lasso and Ridge regressions.\n",
      "\n",
      "In many of the examples, we plotted y vs. y predicted as a visualizaiton inteneded to give us some intition in to where and how the model was working.\n",
      "\n",
      "Making this idea more rigorous, we can use this as an effective meausre of model complexity."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "Image(filename=\"Screen Shot 5.png\", width=600)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from sklearn.neighbors import KNeighborsClassifier\n",
      "from sklearn.neighbors import KNeighborsRegressor\n",
      "from sklearn.linear_model import LinearRegression\n",
      "from sklearn.linear_model import Lasso\n",
      "from sklearn.linear_model import LassoCV\n",
      "from sklearn.linear_model import Ridge\n",
      "from sklearn import cross_validation\n",
      "from sklearn import datasets\n",
      "from sklearn.cross_validation import cross_val_predict\n",
      "from sklearn import linear_model"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# boston house prices data set\n",
      "boston = datasets.load_boston()\n",
      "print boston.keys()\n",
      "#print boston.feature_names\n",
      "print boston.DESCR\n",
      "print boston.target[:10]"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# Set up many trials on same data set to understand \n",
      "#    test vs. sample error\n",
      "#    variance in estimates\n",
      "n_trials, n_max_neighbors = 50, 100\n",
      "err_train, err_test = np.zeros(n_max_neighbors-2), np.zeros(n_max_neighbors-2)\n",
      "for i in range(n_trials):\n",
      "    X_train, X_test, Y_train, Y_test = cross_validation.train_test_split(boston.data, boston.target, test_size=0.3)\n",
      "    x_complexity, y_train, y_test = [], [], []\n",
      "    for k in range(n_max_neighbors, 2, -1):\n",
      "        clf = KNeighborsRegressor(n_neighbors=k)\n",
      "        clf.fit(X_train,Y_train)\n",
      "        x_complexity.append(k)\n",
      "        y_tmp = clf.score(X_train,Y_train)\n",
      "        y_train.append(y_tmp)\n",
      "        err_train[n_max_neighbors-k] += y_tmp\n",
      "        y_tmp = clf.score(X_test,Y_test)\n",
      "        y_test.append(y_tmp)\n",
      "        err_test[n_max_neighbors-k] += y_tmp\n",
      "    plt.plot(x_complexity, y_test, color=\"blue\", alpha=0.2)\n",
      "    plt.plot(x_complexity, y_train, color=\"red\", alpha=0.2)\n",
      "plt.plot(x_complexity, err_test/n_trials, color=\"blue\")\n",
      "plt.plot(x_complexity, err_train/n_trials, color=\"red\")\n",
      "plt.xlabel(\"Number of Neighbors\")\n",
      "plt.ylabel(\"Fraction Correct\")\n",
      "plt.show()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "y = boston.target\n",
      "# cross_val_predict returns an array of the same size as `y` where each entry\n",
      "# is a prediction obtained by cross validation\n",
      "clf = KNeighborsRegressor(n_neighbors=2)\n",
      "predicted = cross_val_predict(clf, boston.data, y, cv=10)\n",
      "fig,ax = plt.subplots()\n",
      "ax.scatter(y, predicted)\n",
      "print np.cov(y,predicted).sum()/(np.std(y)*np.std(y))\n",
      "ax.plot([y.min(), y.max()], [y.min(), y.max()], 'k--', lw=4)\n",
      "ax.set_xlabel('Measured')\n",
      "ax.set_ylabel('Predicted')\n",
      "plt.show()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "y = boston.target\n",
      "# cross_val_predict returns an array of the same size as `y` where each entry\n",
      "# is a prediction obtained by cross validation\n",
      "k_vec = []\n",
      "c_vec = []\n",
      "n_alpha = 5\n",
      "for i in range(n_alpha):\n",
      "    a = 5*(i+1)/float(n_alpha)\n",
      "    clf = Lasso(alpha=a)\n",
      "    predicted = cross_val_predict(clf, boston.data, y, cv=10)\n",
      "    k_vec.append(a) \n",
      "    c_vec.append(np.cov(y,predicted).sum()/(np.std(y)*np.std(y)))\n",
      "fig,ax = plt.subplots()\n",
      "ax.scatter(k_vec, c_vec)\n",
      "ax.set_xlabel('alpha')\n",
      "ax.set_ylabel('dof(y)')\n",
      "plt.show()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "y = boston.target\n",
      "# cross_val_predict returns an array of the same size as `y` where each entry\n",
      "# is a prediction obtained by cross validation\n",
      "k_vec = []\n",
      "c_vec = []\n",
      "n_alpha = 5\n",
      "for k in range(20,2,-1):\n",
      "    clf = KNeighborsRegressor(n_neighbors=k)\n",
      "    predicted = cross_val_predict(clf, boston.data, y, cv=10)\n",
      "    k_vec.append(k) \n",
      "    c_vec.append(np.cov(y,predicted).sum()/(np.std(y)*np.std(y)))\n",
      "fig,ax = plt.subplots()\n",
      "ax.scatter(k_vec, c_vec)\n",
      "ax.set_xlabel('K')\n",
      "ax.set_ylabel('dof(y)')\n",
      "plt.show()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "# Other measures - AIC\n",
      "From wikipedia: offers a relative estimate of the information lost when a given model is used to represent the process that generates the data. In doing so, it deals with the trade-off between the goodness of fit of the model and the complexity of the model."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "Image(filename=\"Selection_008.png\")"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# Author: Olivier Grisel, Gael Varoquaux, Alexandre Gramfort\n",
      "# License: BSD 3 clause\n",
      "\n",
      "import time\n",
      "from sklearn.linear_model import LassoCV, LassoLarsCV, LassoLarsIC\n",
      "\n",
      "diabetes = datasets.load_diabetes()\n",
      "X = diabetes.data\n",
      "y = diabetes.target\n",
      "\n",
      "rng = np.random.RandomState(42)\n",
      "X = np.c_[X, rng.randn(X.shape[0], 14)]  # add some bad features\n",
      "\n",
      "# normalize data as done by Lars to allow for comparison\n",
      "X /= np.sqrt(np.sum(X ** 2, axis=0))\n",
      "\n",
      "##############################################################################\n",
      "# LassoLarsIC: least angle regression with BIC/AIC criterion\n",
      "\n",
      "model_bic = LassoLarsIC(criterion='bic')\n",
      "t1 = time.time()\n",
      "model_bic.fit(X, y)\n",
      "t_bic = time.time() - t1\n",
      "alpha_bic_ = model_bic.alpha_\n",
      "\n",
      "model_aic = LassoLarsIC(criterion='aic')\n",
      "model_aic.fit(X, y)\n",
      "alpha_aic_ = model_aic.alpha_\n",
      "\n",
      "\n",
      "def plot_ic_criterion(model, name, color):\n",
      "    alpha_ = model.alpha_\n",
      "    alphas_ = model.alphas_\n",
      "    criterion_ = model.criterion_\n",
      "    plt.plot(-np.log10(alphas_), criterion_, '--', color=color,\n",
      "             linewidth=3, label='%s criterion' % name)\n",
      "    plt.axvline(-np.log10(alpha_), color=color, linewidth=3,\n",
      "                label='alpha: %s estimate' % name)\n",
      "    plt.xlabel('-log(alpha)')\n",
      "    plt.ylabel('criterion')\n",
      "\n",
      "plt.figure()\n",
      "plot_ic_criterion(model_aic, 'AIC', 'b')\n",
      "plot_ic_criterion(model_bic, 'BIC', 'r')\n",
      "plt.legend()\n",
      "plt.title('Information-criterion for model selection (training time %.3fs)'\n",
      "          % t_bic)\n",
      "plt.show()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "###Also, number of basis fucntions = complexity"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "Image(filename=\"Selection_009.png\")"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "#Regression v. Classification"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "Image(filename=\"Selection_007.png\")"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "iris = datasets.load_iris()\n",
      "X = iris.data[:, 0:2]  # we only take the first two features for visualization\n",
      "y = iris.target\n",
      "#\n",
      "clf = KNeighborsClassifier().fit(X, y)\n",
      "#\n",
      "from matplotlib.colors import ListedColormap\n",
      "# Create color maps\n",
      "cmap_light = ListedColormap(['#FFAAAA', '#AAFFAA', '#AAAAFF'])\n",
      "cmap_bold = ListedColormap(['#FF0000', '#00FF00', '#0000FF'])\n",
      "#\n",
      "h = .02  # step size in the mesh\n",
      "# Plot the decision boundary. For that, we will assign a color to each\n",
      "# point in the mesh [x_min, m_max]x[y_min, y_max].\n",
      "x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1\n",
      "y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1\n",
      "xx, yy = np.meshgrid(np.arange(x_min, x_max, h),\n",
      "                         np.arange(y_min, y_max, h))\n",
      "Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])\n",
      "\n",
      "# Put the result into a color plot\n",
      "Z = Z.reshape(xx.shape)\n",
      "plt.figure()\n",
      "plt.pcolormesh(xx, yy, Z, cmap=cmap_light)\n",
      "\n",
      "# Plot also the training points\n",
      "plt.scatter(X[:, 0], X[:, 1], c=y, cmap=cmap_bold)\n",
      "plt.xlim(xx.min(), xx.max())\n",
      "plt.ylim(yy.min(), yy.max())\n",
      "plt.title(\"knn classification\")\n",
      "\n",
      "plt.show()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# knn classifier\n",
      "n_trials, n_max_neighbors = 30, 100\n",
      "err_train, err_test = np.zeros(n_max_neighbors-2), np.zeros(n_max_neighbors-2)\n",
      "for k in range(n_trials):\n",
      "    X_train, X_test, Y_train, Y_test = cross_validation.train_test_split(X, y, test_size=0.3)\n",
      "    x_complexity, y_train, y_test = [], [], []\n",
      "    for i in range(n_max_neighbors, 2, -1):\n",
      "        clf = KNeighborsClassifier(n_neighbors=i)\n",
      "        clf.fit(X_train,Y_train)\n",
      "        x_complexity.append(i)\n",
      "        y_tmp = clf.score(X_train,Y_train)\n",
      "        y_train.append(y_tmp)\n",
      "        err_train[n_max_neighbors-i] += y_tmp\n",
      "        y_tmp = clf.score(X_test,Y_test)\n",
      "        y_test.append(y_tmp)\n",
      "        err_test[n_max_neighbors-i] += y_tmp\n",
      "    plt.plot(x_complexity, y_test, color=\"blue\", alpha=0.2)\n",
      "    plt.plot(x_complexity, y_train, color=\"red\", alpha=0.2)\n",
      "plt.plot(x_complexity, err_test/n_trials, color=\"blue\")\n",
      "plt.plot(x_complexity, err_train/n_trials, color=\"red\")\n",
      "plt.xlabel(\"Number of Neighbors\")\n",
      "plt.ylabel(\"Fraction Correct\")\n",
      "plt.show()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "Image(filename=\"Selection_012.png\")"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "#VC Dimension"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "Image(filename=\"Selection_011.png\")"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "Image(filename=\"10KWV.png\")"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "You choose the points, then the adversary chooses the labeling. Finally you should be able to produce a hypothesis that correctly classifies that labeling of those points. If you are able to succeed for all labelings of the adversary, we say that the VC dimension is at least the number of points you were able to choose. \u2013  Srivatsan Jan 5 '12"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "From Wikipedia: \n",
      "\n",
      "In statistical learning theory, or sometimes computational learning theory, the VC dimension (for Vapnik\u2013Chervonenkis dimension) is a measure of the capacity (complexity, expressive power, richness, or flexibility) of a statistical classification algorithm, defined as the cardinality of the largest set of points that the algorithm can shatter. It is a core concept in Vapnik\u2013Chervonenkis theory, and was originally defined by Vladimir Vapnik and Alexey Chervonenkis."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "#Summary\n",
      "\n",
      "Three kinds of models:\n",
      "\n",
      "* Describe what is there. E.g. Summary statistics.\n",
      "* Predict from inputs. E.g. Weather today predicts the weather tomorrow.\n",
      "* Prescriptive models give a Why? A prescription for manipulating outcomes by adjusting causal factors.\n",
      "\n",
      "It is difficult to be prescriptive with a model that doesn't predict well. Therefore, _prediction and understanding errors in prediction are central value to model building._\n",
      "\n",
      "When learning from data examples:\n",
      "\n",
      "* training error is optimistic estimator of sample error\n",
      "* estimate of difference scales as $\\approx 2d/N\\sigma^2$\n",
      "\n",
      "\n"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [],
     "language": "python",
     "metadata": {},
     "outputs": []
    }
   ],
   "metadata": {}
  }
 ]
}